Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix infinite loop when computing parse tree for recursive nullable grammars #874

Closed
wants to merge 9 commits into from

Conversation

hudson-ai
Copy link
Collaborator

Computing the parse tree (needed for getting captures) previously caused an infinite loop if the grammar has loops of nullable rules, where a nullable rule is any rule where every symbol on the RHS is nullable.

E.g., the grammar

  A -> A C
  A -> B
  A ->
  B -> A
  C -> 'x'

has nullable rules

  A -> B
  A ->
  B -> A

and there is a loop in these rules formed by A -> B and B -> A.

For a deeper discussion on this, see section "One Last Trap" of https://loup-vaillant.fr/tutorials/earley-parsing/parser

This PR solves the issue by keeping a set of visited nodes when computing the parse tree (or getting captures) and skipping any node that has already been seen. AFAIK, the only way this can happen is when we have these nullable loops, but admittedly I still have a lot to learn about the parser (so more eyes on this would be really helpful!)

This closes #863

while stack:
item, byte_pos = stack.pop()
if (item, byte_pos) in seen:
Copy link
Collaborator Author

@hudson-ai hudson-ai Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just check item

while stack:
pos, item = stack.pop()

if (pos, item) in seen:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- just need to check item?

loop if not handled correctly
"""

@pytest.mark.timeout(5)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in love with checking for an infinite loop by introducing a timeout, but it wasn't straight-forward to catch this otherwise. Anyone have a suggestion?

5 seconds should be more than ample to compute the parse tree...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time thinking about this, and couldn't come up with anything better 🤷

@codecov-commenter
Copy link

codecov-commenter commented Jun 2, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.33%. Comparing base (15ff21d) to head (867adbe).
Report is 36 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #874      +/-   ##
==========================================
+ Coverage   57.02%   60.33%   +3.30%     
==========================================
  Files          63       63              
  Lines        4575     4583       +8     
==========================================
+ Hits         2609     2765     +156     
+ Misses       1966     1818     -148     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

parser.consume_byte(b"x")
captures, _ = parser.get_captures()
assert mock.call_count == 1
assert captures == {"B": b"", "A": b"x"}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note here: I'm not sure what the semantics of partial captures are supposed to be, but this test is consistent with the existing behavior without these nullable loops

Copy link
Collaborator

@riedgar-ms riedgar-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this!

# Test that computing the parse tree doesn't hang
parser.parse_tree()
# Test that getting captures doesn't hang
parser.get_captures()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assert anything about the last two operations?

Also, if either of them can hang on an infinite loop, would it be worth having two tests, to identify which one hit the loop (although that may be more than needed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reserved making asserts for some of the other tests, but I'll have a think 😄

get_captures actually calls parse_tree under the hood BUT it can also hang even if parse_tree doesn't. I can maybe put another mock in here to disentangle them a bit..?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about injecting another mock. I was just wondering if there was any visible state change which could be asserted after each step?

A -> A
A ->

Loop occurs because `A -> A` is a nullable rule
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with the JSON 'linked list' here? Doesn't that come down to something like this?

Copy link
Collaborator Author

@hudson-ai hudson-ai Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This is pretty subtle and took me a while to understand.

The grammar

A -> A 'x'
A ->

is totally fine because the first rule isn't nullable by the definition I gave above because 'x' isn't nullable -- only A is.

I.e.

zero_or_more('x') is fine; zero_or_more(optional('x')) isn't.

The latter looks like

A -> A B
A ->
B -> 'x'
B ->

Linked lists look more like the former.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the fact that the existing JSON tests passed is a good sign, but I'm still trying to work out these grammars.

# Test that computing the parse tree doesn't hang
parser.parse_tree()
# Test that getting captures doesn't hang
parser.get_captures()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these tests, it would be nice if there was something to assert about the state of the parser at the end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will keep thinking about this!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on making some assertions here, I became dissatisfied with the fact that my fix doesn't actually prevent infinite parse trees like A -> A -> ...; it only prevents looping over these parse trees forever. The behavior is right in that we're still able to correctly get captures, so maybe this is fine. But marking this as a draft in the meantime while I see if I can get a bit closer to the root of the issue.

@hudson-ai hudson-ai changed the title Fix infinite loop when computing parse tree for recursive nullable grammars [WIP] Fix infinite loop when computing parse tree for recursive nullable grammars Jun 3, 2024
@hudson-ai
Copy link
Collaborator Author

Currently, my fix is to avoid infinitely recursing over children, but I suspect that it might be nicer to prevent accepting parses that have infinitely looping children to begin with. That being said, the behavior seems correct (in that captures are correctly being... captured).

@slundberg I would really appreciate some input from you on this as well as what the correct semantics of partial captures should be (I tried to preserve existing behavior, but I'm not entirely clear on whether that behavior was the intended one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CPU Bound Issue in Parser with Complex Grammar (Possible Error with handling of Zero-Width/empty Strings)
4 participants